Deliver synchronous fault signals to the faulting thread by Max042004 · Pull Request #86 · sysprog21/elfuse

Max042004 · 2026-06-06T13:53:02Z

Note

Stacked on #76. This branch is based on fix-cross-process-signal-el0 (#76),
whose commit appears here until #76 merges; afterwards this PR auto-reduces to the
single fault-delivery commit. Review #76 first.

A guest exception (SIGSEGV/SIGBUS/SIGILL/SIGFPE/SIGTRAP) was delivered via
signal_set_fault_info() (which records si_code/si_addr in a _Thread_local slot)
followed by signal_queue() + signal_deliver(). But signal_queue() sets a bit in
the process-wide pending mask, and every vCPU thread runs signal_deliver() at its own
poll points, so a fault raised by thread A could be dequeued and delivered by thread B —
whose thread-local fault info is empty. The siginfo then degraded to si_code=SI_USER
with no si_addr instead of SEGV_MAPERR/si_addr. A JVM treats such a SIGSEGV as a
fatal external signal rather than a recoverable implicit null check, so javac crashed.
Two threads faulting on the same signal also collapsed into one bitmask bit, dropping
one fault.

Add signal_deliver_fault(), which delivers a given signal directly to the current
thread using its thread-local fault info and never touches the process-wide pending set.
signal_deliver()'s frame-building tail is factored into deliver_signal_locked() and
shared. All four synchronous-fault sites in the vCPU loop (SEGV_ACCERR, BRK→SIGTRAP,
SIGILL, data-abort SIGSEGV) route through it.

Adds test-fault-signal-mt: N threads each take recoverable SIGSEGVs and assert every
delivery is SEGV_MAPERR with the right si_addr on the faulting thread. It fails
deterministically (rc=2) on the old queue-based path and passes on the fix.

Summary by cubic

Deliver synchronous fault signals to the faulting vCPU thread to keep correct si_code/si_addr and avoid lost faults. Fixes JVM crashes caused by SIGSEGV delivered as SI_USER and makes multi-threaded faults reliable.

Bug Fixes
- Added signal_deliver_fault() and routed all synchronous guest exceptions through it; bypasses the process-wide pending mask.
- Factored handler frame build into deliver_signal_locked(); supports EL0 preemption by redirecting via HV_REG_PC/CPSR.
- Updated vcpu_run_loop to use the new path and to treat HV_EXIT_REASON_UNKNOWN like CANCELED so preempted runs resume cleanly.
- Added test-fault-signal-mt to verify per-thread SIGSEGV delivery; Makefile and manifest updated.

^{Written for commit 23b2307. Summary will update on new commits.}

test-fork's phase-2 signal child spins in `while (!got_usr1) usleep()` waiting for a SIGUSR1 sent cross-process by its parent. The signal was delivered only ~35% of the time and lost the rest, so `make check` hung at test-fork until the 60s per-test timeout -- often longer, as leaked `elfuse --fork-child` orphans kept the driver's stdout pipe open. Two complementary defects: 1. vcpu_run_loop treated HV_EXIT_REASON_UNKNOWN as fatal. A host SIGUSR2 (the cross-process guest-signal transport) that interrupts hv_vcpu_run mid-execution aborts the run with UNKNOWN rather than the clean CANCELED that hv_vcpus_exit() produces for a vCPU caught between runs. Route UNKNOWN through the same cancellation handling so the already-queued guest signal is delivered instead of crashing the child. 2. signal_deliver redirected to the handler only via ELR_EL1, which takes effect solely on an ERET from EL1 (the syscall-return path, gated by the shim's X8==2 exec_drop_frame marker). When the signal is delivered from the cancellation branch -- i.e. the vCPU was preempted while running EL0 code (cross-process SIGUSR2, or SIGALRM in a tight loop) -- there is no pending ERET, the resume uses HV_REG_PC, and the ELR_EL1 write is a no-op: the handler never runs and only the X0=signum clobber lands, re-running the interrupted nanosleep with a bogus arg and spinning forever. Detect EL0 preemption from the live PSTATE (CPSR M[3:0]==0), save the interrupted PC from HV_REG_PC instead of the stale ELR_EL1, and redirect HV_REG_PC/CPSR directly; skip the X8==2 marker since there is no shim frame to drop. test-fork now passes 20/20 (was ~7/20); `make check` is green with no hang.

A guest exception (SIGSEGV/SIGBUS/SIGILL/SIGFPE/SIGTRAP) was delivered by calling signal_set_fault_info() (which records si_code/si_addr in a _Thread_local slot) followed by signal_queue() + signal_deliver(). But signal_queue() sets a bit in the process-wide pending mask, and every vCPU thread runs signal_deliver() at its own poll points. So a fault raised by thread A could be dequeued and delivered by thread B, whose thread-local fault info is empty -- the siginfo then degrades to si_code=SI_USER with no si_addr instead of SEGV_MAPERR/si_addr. A JVM treats a SIGSEGV with si_code=SI_USER as a fatal external signal rather than a recoverable implicit null check, so javac (heavily multi-threaded, many null-check faults) crashed with a fatal error report. Two threads faulting on the same signal also collapsed into a single bitmask bit, dropping one fault entirely. Add signal_deliver_fault(), which delivers a given signal directly to the current thread using its thread-local fault info and never touches the process-wide pending set. signal_deliver()'s frame-building tail is factored into deliver_signal_locked() and shared. Route all four synchronous-fault sites in the vCPU loop (SEGV_ACCERR, BRK->SIGTRAP, SIGILL, data-abort SIGSEGV) through it. Add test-fault-signal-mt: N threads each take recoverable SIGSEGVs in a loop and assert every delivery is SEGV_MAPERR with the right si_addr on the faulting thread. It fails deterministically (rc=2, fault delivered to a non-faulting thread) on the old queue-based path and passes on the fix. (cherry picked from commit 201b4fa0d2ffb2218e252be71930e1cfb9e360a5)

jserv · 2026-06-06T16:43:27Z

+     */
+    pthread_mutex_lock(&sig_lock);
+    signal_rt_info_t rt_info = signal_default_info(signum);
+    return deliver_signal_locked(vcpu, g, signum, rt_info, exit_code);


The comment two lines up calls this "intentionally ignored: a synchronous fault cannot be postponed" -- that's only half the Linux contract. force_sig_info_to_task in kernel/signal.c resets the disposition to SIG_DFL and unblocks the signum before applying default, so a SIGSEGV/SIGBUS/SIGILL/SIGFPE/SIGTRAP that's blocked OR set to SIG_IGN terminates the process. This PR's behavior diverges in two ways:

SIG_IGN on synchronous fault (preexisting): deliver_signal_locked at signal.c:1352 returns 0 for SIG_IGN. The vCPU loop resumes, PC unchanged, the same instruction re-faults forever. Linux would terminate.

Blocked synchronous fault (new in this PR): pre-PR, signal_deliver's deliverable = sig_state.pending & ~*blocked short-circuited and the thread re-faulted infinitely. Post-PR, signal_deliver_fault bypasses the blocked mask entirely and runs the user handler despite the block. Linux would reset to SIG_DFL + unblock + terminate.

JVM works either way because it never blocks SIGSEGV, but the contract drift is real. A five-line precheck closes both cases with one shape:

int signal_deliver_fault(hv_vcpu_t vcpu, guest_t *g, int signum, int *exit_code) { pthread_mutex_lock(&sig_lock); uint64_t *blocked = thread_blocked_ptr(); linux_sigaction_t *act = &sig_state.actions[signum - 1]; if (act->sa_handler == LINUX_SIG_IGN || (*blocked & sig_bit(signum))) { /* Linux force_sig_info_to_task: forced synchronous faults reset * disposition to SIG_DFL, unblock the signum, then apply default. */ act->sa_handler = LINUX_SIG_DFL; *blocked &= ~sig_bit(signum); } signal_rt_info_t rt_info = signal_default_info(signum); return deliver_signal_locked(vcpu, g, signum, rt_info, exit_code); }

Update the "intentionally ignored" comment accordingly.

jserv

Rebase onto the latest main branch, resolve any merge conflicts, and refine the changes based on the review feedback.

cubic-dev-ai

1 issue found across 6 files

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="src/syscall/proc.c">

<violation number="1" location="src/syscall/proc.c:1948">
P1: When UNKNOWN exits are treated as preemption, rseq abort can read a stale PC and fail to redirect to `abort_ip`, leaving a preempted rseq critical section to resume incorrectly. Consider syncing ELR/SPSR from `HV_REG_PC/HV_REG_CPSR` (or using `HV_REG_PC` directly) on UNKNOWN before the `rseq_try_abort()` path.</violation>
</file>

_{Reply with feedback, questions, or to request a fix.

Re-trigger cubic}

cubic-dev-ai · 2026-07-02T16:43:16Z

            }
-        } else if (vexit->reason == HV_EXIT_REASON_CANCELED) {
+        } else if (vexit->reason == HV_EXIT_REASON_CANCELED ||
+                   vexit->reason == HV_EXIT_REASON_UNKNOWN) {


P1: When UNKNOWN exits are treated as preemption, rseq abort can read a stale PC and fail to redirect to abort_ip, leaving a preempted rseq critical section to resume incorrectly. Consider syncing ELR/SPSR from HV_REG_PC/HV_REG_CPSR (or using HV_REG_PC directly) on UNKNOWN before the rseq_try_abort() path.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At src/syscall/proc.c, line 1948: <comment>When UNKNOWN exits are treated as preemption, rseq abort can read a stale PC and fail to redirect to `abort_ip`, leaving a preempted rseq critical section to resume incorrectly. Consider syncing ELR/SPSR from `HV_REG_PC/HV_REG_CPSR` (or using `HV_REG_PC` directly) on UNKNOWN before the `rseq_try_abort()` path.</comment> <file context> @@ -1941,11 +1944,22 @@ int vcpu_run_loop(hv_vcpu_t vcpu, } - } else if (vexit->reason == HV_EXIT_REASON_CANCELED) { + } else if (vexit->reason == HV_EXIT_REASON_CANCELED || + vexit->reason == HV_EXIT_REASON_UNKNOWN) { /* Canceled by hv_vcpus_exit(). Can be: alarm timeout, * exit_group from another thread, or signal preemption </file context>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Deliver synchronous fault signals to the faulting thread#86

Deliver synchronous fault signals to the faulting thread#86
Max042004 wants to merge 2 commits into
sysprog21:mainfrom
Max042004:fix-sync-fault-signal

Max042004 commented Jun 6, 2026 •

edited by cubic-dev-ai Bot

Loading

Uh oh!

This comment was marked as resolved.

Uh oh!

jserv Jun 6, 2026

Uh oh!

This comment was marked as duplicate.

Uh oh!

jserv left a comment

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

cubic-dev-ai Bot Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Max042004 commented Jun 6, 2026 • edited by cubic-dev-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by cubic

Uh oh!

This comment was marked as resolved.

Uh oh!

jserv Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

This comment was marked as duplicate.

Uh oh!

jserv left a comment

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Max042004 commented Jun 6, 2026 •

edited by cubic-dev-ai Bot

Loading